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Abstract 

Gaussian processes allow for flexible specification of prior assumptions of unknown dynamics in state space mod¬ 
els. We present a procedure for efficient Bayesian learning in Gaussian process state space models, where the represen¬ 
tation is formed by projecting the problem onto a set of approximate eigenfunctions derived from the prior covariance 
structure. Learning under this family of models can be conducted using a carefully crafted particle MCMC algorithm. 
This scheme is computationally efficient and yet allows for a fully Bayesian treatment of the problem. Compared to 
conventional system identification tools or existing learning methods, we show competitive performance and reliable 
quantification of uncertainties in the model. 
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Abstract 

Gaussian processes allow for flexible specification of 
prior assumptions of unknown dynamics in state space 
models. We present a procedure for efficient Bayesian 
learning in Gaussian process state space models, where 
the representation is formed by projecting the problem 
onto a set of approximate eigenfunctions derived from 
the prior covariance structure. Learning under this fam¬ 
ily of models can be conducted using a carefully crafted 
particle MGMG algorithm. This scheme is computation¬ 
ally efficient and yet allows for a fully Bayesian treat¬ 
ment of the problem. Gompared to conventional sys¬ 
tem identification tools or existing learning methods, we 
show competitive performance and reliable quantifica¬ 
tion of uncertainties in the model. 


1 INTRODUCTION 


Gaussian processes (GPs, Rasmussen and Williarns) 
2006} have been proven to be powerful probabilistic 
non-parametric modeling tools for static nonlinear func¬ 
tions. However, many real-world applications, such 
as control, target tracking, and time-series analysis are 
tackling problems with nonlinear dynamical behavior. 
The use of GPs in modeling nonlinear dynamical sys¬ 
tems is an emerging topic, with many strong contribu¬ 
tions during the recent years, for example the work by 
Turner et al. ( 2010| , Frigola et al. (2013 2014a|b| and 
Mattos et al. ( 2016} . The aim of this paper is to ad¬ 
vance the state-of-the-art in Bayesian inference on Gaus¬ 
sian process state space models (GP-SSMs). As we will 
detail, a GP-SSM is a state space model, using a GP as its 
state transition function. Thus, the GP-SSM is not a GP 
itself, but a state space model {i.e., a dynamical system). 


Overviews of GP-SSMs are given by, e.g., McHutchon] 
( [2014} and Frigola-Alcade ( |2015} . 

We provide a novel reduced-rank model formulation 
of the GP-SSM with good convergence properties both in 
theory and practice. The advantage with our approach 
over the variational approach by Frigola et al. ( |2014b} , 
as well as other inducing-point-based approaches, is 
that our approach attempts to approximate the opti¬ 
mal Karhunen-Loeve eigenbasis for the reduced-rank 
approximation instead of using the sub-optimal Nys- 


trom approximation which implicitly is the underly¬ 
ing approximation in all inducing point methods. Be¬ 
cause of this we do not need to resort to variational ap¬ 
proximations, but we can instead perform the Bayesian 
computations in full. By utilizing the structure of the 
reduced-rank model, we construct a computationally ef¬ 
ficient linear-time-complexity MGMG-based algorithm 
for learning in the proposed GP-SSM model, which we 
demonstrate and evaluate on several challenging exam¬ 
ples. We also provide a proof of convergence of the 
reduced-rank GP-SSM to a full GP-SSM (in the supple¬ 
mentary material). 

GP-SSMs are a general class of models defining a dy¬ 
namical system for t = 1,2,..., T given by 


Xf+l =f(Xt)-tW„ 

with f(x) ~ f/T’(0,/«g_y(x,x')), (la) 

Yf =g(xt)-tef, 

with g(x)~gV(O,H 0 ^g(x,x')), (lb) 

where the noise terms and Cf are i.i.d. Gaussian, 
Wf ~ Af(0,Q) and ~ Af(0,R). The latent state Xf e IR"* 
is observed via the measurements yt e IR”^. The key 
feature of this model is the nonlinear transformations 
f : ]R”^ ^ and g : ^ E”j' which are not known 

explicitly and do not adhere to any specific parametriza- 
tion. The model functions f and g are assumed to be 
realizations from a Gaussian process prior over E"^ with 
a given covariance function Kg{x,x') subject to some hy¬ 
perparameters 9. Learning of this model, which we will 
tackle, amounts to inferring the posterior distribution 
of f, g, Q, R, and 9 given a set of (noisy) observations 
yi:r = !yi}f=i- 

The strength of including the GP in ([^ is its ability to 
systematically model uncertainty —not only uncertainty 
originating from stochastic noise within the system, but 
also uncertainty inherited from data, such as few mea¬ 
surements or poor excitation of the dynamics in certain 
regions of the state space. An example of this is given 
by Figurewhere we learn the posterior distribution of 
the unknown function f(-) in a GP-SSM (see Sec. [^for 
details). An inspiring real-world example on how such 
probabilistic information can be utilized for simultane¬ 
ous learning and control is given by jDeisenroth et al. 

dloTs) . 
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(a) The learned model (b) The posterior weights 


Figure 1: An example illustrating how the GP-SSMs handle uncertainty, (a) The learned model from data yi-r- The 
bars show where the data is located in the state space, i.e., what part of the model is excited in the data set, affecting 
the posterior uncertainty in the learned model, (b) Our approach relies on a basis function expansion of /, and 
learning / amounts to finding the posterior distribution of the weights depicted by the histograms. 


Non-probabilistic methods for modeling nonlinear 
dynamical systems include learning of state space mod¬ 
els using a basis function expansion ( Ghahramani and| 
Roweis[|l998[ |, but also nonlinear extensions of AR(MA) 
and GARGH models from the time-series analysis lit¬ 
erature ( |Tsay [201 0| |, as well as nonlinear extensions of 
ARX and state space models from the system identifica¬ 
tion literature (Sjoberg et al. 1995[ Ljungj 1999^ . In par¬ 
ticular, nonlinear ARX models are now a standard tool 


roth and Mohamed||2012[|, and recent interest has been 


in learning GP-SSMs ||Turner et aPj |2010| |Frigola et al. 
2M3l|2014a]b) . An inherent problem in learning the 
GP-SSM is the entangled relationship between the states 
Xf and the nonlinear function f('). Two different ap¬ 
proaches have been proposed in the literature: In the 
first approach the GP is represented by a parametrized 
form ([Turner et al. use a pseudo-training data set, akin 
to the inducing inputs by jPrigola et al.||2014b whereas 
we will employ a basis function expansion). The second 


for the system identification engineer (The MathWorks,| 
!n^[2M5). For probabilistic modeling, the latent force 
model (Alvarez et al. |2009| presents one approach for 
modeling dynamical phenomena using GPs by encoding 
a priori known dynamics within the construction of the 
GP. Another approach is the Gaussian process dynamical 
model ( Wang et al.[ 2008} , where a GP is used to model 
the nonlinear function within an SSM, that is, a GP-SSM. 
However, the work by Wang et al. ( |2008| is, as opposed 
to this paper, mostly focused around the problem set¬ 
ting when Tiy^rix. That is also the focus for the further 


approach (used by Frigola et al. 2013} 2014a[ is han¬ 
dling the nonlinear function implicitly by marginaliz¬ 
ing it out. Goncerning learning. Turner et al.| ( 2010} 
and Frigola et al. ( 2014a| us e an EM-based procedure, 
whereas we and Frigola et al. ( 2013| use an MGMG algo¬ 
rithm. 

The main bottleneck prohibiting the use in practice 
of some of the previously proposed GP-SSMs methods is 
the computational load. For example, the training of a 
one-dimensional system using T = 500 data points {i.e., 
a fairly small example) is in the magnitude of several 
hours for the solution by Frigola et al. ( 2013| . Akin to 
Frigola et al. ( 2014b| , our proposed method will typi¬ 
cally handle such an example within minutes, or even 
less. To reduce the computational load, [Frigola et al. 


( 2014b| suggests variational sparse GP techniques to ap¬ 


proximate the solution. Our approach, however, is using 


development by Damianou et al. ( 2011| , where the EM 
algorithm for learning is replaced by a variational ap¬ 
proach. 

State space filtering and smoothing in GP-SSMs has 
been tackled before (e.g., Deisenroth et al.||2012[ Deisen- 


the reduced-rank GP approximation by Solin and Sarkka 
( 2014| , which is a disparate solution with different prop¬ 
erties. The reduced-rank GP approximation enjoys fa¬ 
vorable theoretical properties, and we can prove conver¬ 
gence to a non-approximated GP-SSM. 

The outline of the paper is as follows: In Section [^ 
we will introduce reduced-rank Gaussian process state 
space models by making use of the representation of 
GPs via basis functions corresponding to the prior co- 
variance structure ( [Solin and Sarkk^ |2014| , a theoreti¬ 
cally well-supported approximation significantly reduc¬ 
ing the computational load. In Section[^we will develop 
an algorithm for learning reduced-rank Gaussian pro¬ 
cess state space models by using recent MGMG methods 
( [Lindsten et al.[|2014j Wills et al. 2012\ . We will also 
demonstrate it on synthetic as well as real data examples 
in Section[^ and finally discuss the contribution and fur¬ 
ther extensions in Section!^ 


2 

















































































































































2 REDUCED-RANK GP-SSMs 


We use GPs as flexible priors in Bayesian learning of the 
state space model. The covariance function k(x,x') en¬ 
codes the prior assumptions of the model functions, thus 
representing the best belief of the behavior of the non¬ 
linear transformations. In the following we present an 
approach for parametrizing this model in terms of an 
OT-rank truncation of a basis function expansion as pre¬ 
sented by Solin and Sarkka | |2014| . Related ideas have 
also been proposed by, for example, |Lazaro-Gredilla 

eTdlldloT^ . 

Provided that the covariance function is stationary 
(homogeneous, i.e. k(x - x') = k(x,x')), the covariance 
function can be equivalently represented in terms of the 
spectral density S(a;). This Fourier duality is known as 
the Wiener-Khintchin theorem, which we parametrize as: 
S(u}) = J K(r) exp(-iu;^r)dr. We employ the relation pre¬ 
sented by Solin and Sarkka | |2014| to approximate the 
covariance operator corresponding to k(-). This operator 
is a pseudo-differential operator, which we approximate 
by a series of differential operators, namely Laplace op¬ 
erators V^. In the isotropic case, the approximation of 
the covariance function is given most concisely in the 
following form: 


as 


K0(x,x') g{\j) (!)''}'> 

;=1 


( 2 ) 


where Sg(-) is the spectral density function of Kgi-), and 
A; and are the Laplace operator eigenvalues and 


eigenfunctions solved for the domain Q 3 x. See [Solin 
and Sarkka ( 2014| for a detailed derivation and conver¬ 
gence proofs. 

The key feature in the Hilbert space approximation 
is that Ay and are independent of the hyperparame¬ 
ters 6 , and it is only the spectral density that depends on 
9. Equation Q is a direct approximation of the eigen- 
decomposition of the Gram matrix {e.g., Rasmussen andl 
Williams|2006| , and it can be interpreted as an optimal 
parametric expansion with respect to the given covari¬ 
ance function in the GP prior. 

In terms of a basis function expansion, this can be ex¬ 
pressed as 


f(x)~gr(o,K(x,x')) ^ 


f(x) ^ ^/W(/)(^)(x), (3) 

;=i 


where ~ Af(0, S(Ay)). In the case > 1, this formu¬ 
lation does allow for non-zero covariance between dif¬ 
ferent components of the state space. We can now for¬ 
mulate a reduced-rank GP-SSM, corresponding to (laI, 


Xf+l = 



Am)- 

■ h 

(/)(l)(Xt) 


Am) 

■ Jnx 

y/)(-)(x,)_ 


-tw,. 


(4) 


A cp(x,) 

and similarly for ( |lb[ |. Henceforth we will consider a 
reduced-rank GP-SSM, 


Xf+i =AO(xt) + Wt, 
Yt = CO{xt) + et, 


(5a) 

(5b) 


where A and C are matrices of weights with priors for 
each element as described by |[^. 

3 LEARNING GP-SSMs 

Learning in reduced-rank Gaussian process state space 
models ^ from yj.f amounts to inferring the posterior 
distribution of A, C, Q, R, and the hyperparameters 9. 
For clarity in the presentation, we will focus on inferring 
the dynamics, and assume the observation model (g(-) 
and R) to be known a priori. However, the extension to 
an unknown observation model—as well as exogenous 
input signals—follows in the same fashion, and will be 
demonstrated in the numerical examples. 

To infer the sought distributions, we will use a blocked 
Gibbs sampler outlined in Algorithm Although in¬ 
volving sequential Monte Garlo (SMG) for inference in 
state space, the validity of this approach is not relying 
on asymptotics {N oo) in the SMG algorithm, thanks 
to recent particle MGMG methods (Lindsten et al. 2014| 
Andrieu et aL[[2010[ |. 


It is possible to learn ([^ under different assumptions 
on what is known. We will focus on the general (and in 
many cases realistic) setting where the distributions of 
A, Q and 9 are all unknown. In cases when Q or 0 are 
known a priori, the presented scheme is straightforward 
to adapt. To be able to infer the posterior distribution of 
Q and 9, we make the additional prior assumptions: 


Q~IW(£q,Aq), 


9-p(9), 


( 6 ) 


where IW denotes the Inverse Wishart distribution. For 
brevity, we will omit the problem of finding the un¬ 
known initial distribution p(xi). It is possible to treat 
this rigorously akin to 9, but it is of minor importance in 
most practical situations. We will now in Section [3T - 3.3 
explain the four main steps[3]-[^in Algorithm[T] 

3.1 Sampling in State Space with SMC 

SMG methods ( [Doucet and Johansenj |2011| are a fam¬ 
ily of techniques developed around the problem of in¬ 
ferring the posterior state distribution in SSMs. SMG 
can be seen as a sequential application of importance 
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Algorithm 1 Learning of reduced-rank GP-SSMs. 


Input: Data yi-x, priors on A, Q and 6 . 

Output: K MCMC-samples with p(xi. f,Q,A,6 \ ynx) as invariant distribution. 


Sample initial xi.'r'[O],Q[O],A[O],0[O]. 
for fc = 0 to X do 
Sample Xi. 7 ’[/c-i-1 
Sample Q[/c-i-l 
Sample A[/c-i-l 
Sample 9[k+l 
end for 


Q[k],A[k],e[k] 
A[k],e[k],xi.x[k +1] 
e[k],xi,T[k+l],Q[k+l] 


by Algorithm 2 
according to | |10| . 
according to ©■ 


xi.j-[k+ l],Q[k-i- l],A[k-i-1] by using MH (Section 


3.3 


sampling along the sequence of distributions ...,p(Xf_i | 
I Yi-.t)’--- with a resampling procedure to 
avoid sample depletion. 

To sample the state space trajectory Xj.y, conditional 
on a model A, Q and data yi-j, we employ a conditional 
particle filter with ancestor sampling, forming a particle 
Gibbs Markov kernel Algorithm[2|(PGAS,|Lindsten et al. 
2 M 4 ) . PGAS can be thought of as an SMG algorithm 
for finding the so-called smoothing distribution pjxj^.y | 
A,Q,yi:'f) to be used within an MGMG procedure. 

3.2 Sampling of Covariances and Weights 


density of the covariance function as its diagonal entries: 

V = diag([S-i(Ai) ••• S-i(A^)]). (8) 

This is how the prior from ([^ enters the formulation. 
(Note that the marginal variance of each element in A is 
also scaled Q, and thereby £q,Aq. For notational conve¬ 
nience, we refrain from introducing a scaling factor, but 
let it be absorbed into the covariance function.) With 
this (conjugate) prior, the posterior follows analytically 
by introducing the following statistics of the sampled 
trajectory Xi-j-: 


The sampling of the weights A and the noise covariance 
Q, conditioned on Xj^.y and 6 , can be done exactly, by 
following the procedure of Wills et al. | |2012[ |. With the 
priors ([^ and |[^, the joint prior of A and Q can be writ¬ 
ten using the Matrix Normal Inverse Wishart (MNIW) 
distribution as 


P(A,Q)=MMIW(A,Q I 0,V,£q,Aq). 




Ztz], 


(9) 


t=i 


t=i 


t=i 


where Ct = ^t+i and Zf = [(j)(i)(xt)...(j)(”')(xt)]^. Using 
the Markov property of the SSM, it is possible to write 


(7) 


the conditional distribution for Q as (Wills et al. 
Eq. (42)): 


2012 


Details on the parametrization of the MNIW distribution 
we use is available in the supplementary material, and it 
is given by the hierarchical model p(Q) =IVV(Q |f’Q,AQ) 
and p(A | Q) = MN{A \ 0,Q,V). For our problem, the 
most important is the second argument, the inverse row 
covariance V, a square matrix with the inverse spectral 


p(Q |xi:r,yi:r) = 

IW(Q \T + £Q,AQ + ((t>-'V{J: + Y)-^W~^)). (10) 


Given the prior Q, A can now to be sampled from (Wills 


et al. 2012 Eq. (43)): 


Algorithm 2 Particle Gibbs Markov kernel. 


Input: Trajectory xi-j[k], number of particles N 
Output: Trajectoryxi.7'[/c + 1] 

1: Sample ~ p{xi ), for i = -1. 

2: Set = xj [k]. 

3: For f = 1 to T 

4: Set le)'* =p(y, |X*‘>) = At(g(x*'>) | y„R), for i = 1. N. 

5: Sample with ^ for i = 1,...,N-1. 

Ji) 

6: Sample x^'^^ ~ At(f(xA ),Q), for / = 1. 

7: Set xjj_j = X(+i [fc], 

8: Sample with = j) oc 


9 

10 

11 


I **/*) = I f(xj^’),Q). 


Set Xi, 


(i) , r 

Xf+i), for l : 




End for 

Sample / with P(7 = i) cc and set xi. j[k + 1] = *1.7- 


p(A I Q,xi.r,yi.T') = 

A47V(A|'P(E-hV)“i,Q,(E-hV)“b. (H) 

3.3 Marginalizing the Hyperparameters 

Goncerning the sampling of the hyperparameters 6 , we 
note that we can easily evaluate the conditional distribu¬ 
tion p{6 I Xi.x,Q,A) up to proportionality as 

p(0 |xi:r,Q,A)oc 

p(e)p(Q I x,.,T,Q,e)p{A I xx,T,Q,A,e). ( 12 ) 
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To utilize this, we suggest to sample the hyperparame¬ 
ters by using a Metropolis-Hastings (MH) step, resulting 
in a so-called Metropolis-within-Gibbs procedure. 

































4 THEORETICAL RESULTS 

Our model and learning Algorithm[^inherits certain 
well-defined properties from the reduced-rank approxi¬ 
mation and the presented sampling scheme. In the first 
theorem, we consider the convergence of a series expan¬ 
sion approximation to the GP-SSM with an increasing 
number m of basis functions. As in ISolin and Sarkk^ 
| (20T4) , we only provide the convergence results for a 
rectangular domain with Dirichlet boundary conditions, 
but the result could easily be extended to a more general 
case. Proofs for all theorems are included in the supple¬ 
mentary material. 


y to the true data y® using the root mean square error 
(RMSE) and the mean log likelihood (LL): 


RMSE = 




and 


A Te /- 

' ® t=l 


1 ‘ 

LL = — ^logAf(y? I ]E[yf], V[yt]). 

t=i 


(13) 


(14) 


The source code for all examples is available via the first 
authors homepage. 


Theorem 4.1. The probabilistic model implied by the dy¬ 
namic and measurement models of the approximate GP-SSM 5.1 Synthetic Data 
convergences in distribution to the exact GP-SSM, when the 
size of the domain Q and the number of basis functions m 
tends to infinity. 


As a proof-of-concept already presented in Eigure[^ 
have T = 500 data points from the model 


we 


The above theorem means that in the limit any prob¬ 
abilistic inference in the approximate model will be 
equivalent to inference on the exact model, because the 
prior and likelihood models become equivalent. The 
benefit of considering the m-rank model instead of a 
standard GP, is the following: 

Theorem 4.2. Provided the rank-reduced approximation, 
the computational load scales as O(m^T) as opposed to 
0 {T^). 

Purthermore, the proposed learning procedure enjoys 
sound theoretical properties: 

Theorem 4.3. Assume that the support of the proposal in 
the MH algorithm covers the support of the posterior p{6 \ 
N > 2 in Algorithm^ Then the in¬ 
variant distribution of Algorithm^is p(xi.x,Q,A,9 \ yi-x)- 

Hence, Theorem |4.3| guarantees that our learning pro¬ 
cedure indeed is sampling from the distribution we ex¬ 
pect it to, even when a finite number of particles N >2is 
used in the Monte Garlo based Algorithm It is also 
possible to prove uniform ergodicity f or Algorithm [T 
as such a result exists for Algorithm ||Lindsten et al 

2M4) . 


Xf+i = tanh(2x,) +Wf, yt=Xt + et, 


(15) 


where Cf ~ J\f(0,0.1) and Wt ~ J\f(0,0.1). We in¬ 
ferred / and Q, using a GP with the exponentiated 
quadratic (squared exponential, parametrized as in |Ras- 
mussen and Williams 2006} covariance function with 
unknown hyperparameters, and Q ~ X>V(10,1) as pri¬ 
ors. In this one-dimensional case (x e [-L,L],L = 4), 
the eigenvalues and eigenfunctions are Aj = (Tr;7(2L))^ 
and (j)L)(x) = l/x[Lsm(ni(x +L)/(2L)). The spectral den¬ 
sity corresponding to the covariance function is Sgjw) = 
CT^V27r72 exp(-a>^£^/l). 

The posterior estimate of the learned model is shown 
in Pigure[^ together with the samples of the basis func¬ 
tion weights /LA The variance of the posterior distribu¬ 
tion of / increases in the regimes where the data is not 
exciting the model. 

As a second example, we repeat the numerical bench¬ 


mark example on synthetic data from Prigola et al. 
( 2014b| : A one-dimensional state space model Xf^i = 
Xf -i-1 + Wf, if Xf < 4, and Xf^i = -Axf - 1 - 21, if Xf > 4 with 
known measurement equation yt = Xf + Cf, and noise dis¬ 
tributed as Wt ~ J\f(0,1) and Cf ~ AfjO, 1). The model is 
learned from T = 500 data p oints, and evaluated w ith 


T = 


10 = 


data points. As in Prigola et al. (2014b|, a 


5 NUMERICAL EXAMPLES 

In this section, we will demonstrate and evaluate our 
contribution, the model § and the associated learning 
Algorithm using four numerical examples. We will 
demonstrate and evaluate the proposed method (includ¬ 
ing the convergence of the learning algorithm) on two 
synthetic examples and two real-world datasets, as well 
as making a comparison with other methods. 

In all examples, we separate the data set into train¬ 
ing data and evaluation data To evaluate the per¬ 
formance quantitatively, we compare the estimated data 


Matern covariance function is used (see, e.g., Section 
4.2.1 of [Rasmussen and Williams||200^ for details, in¬ 
cluding its spectral density). The results for our model 
with K = 200 MGMG iterations and m = 20 basis func¬ 
tions are provided in Table 


We also re-state two results from Prigola et al. (2014b >: 
The GP-SSM method by jPrigola et al. | |2013| (\^ich also 
uses particle MGMG for learning) and the variational 
GP-SSM by Prigola et al. (2014b|. Due to the compact 


writing in Prigola et al. ^2013[ 2014b|, we have not been 


able to reproduce the results, but to make the compar¬ 
ison as fair as possible, we average our results over 10 
runs (with different realizations of the training data). 
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Table 1: Results for synthetic and real-data numerical examples. 


Data / Method RMSE LL Train time [min] Test time [s] Comments 


Synthetic data: 


PMCMC GP-SSM Frigola ET AL. 2013| 

1.12 

-1.57 

Variational GP-SSM|Frigola et al. 21)14b 
Reduced-rank GP-SSM 

1.15 

-1.61 

1.10 

-1.52 

Damper modeling: 

Linear OE model (4th order) 

27.1 

N/A 

Hammerstein-Wiener {4th order) 

27.0 

N/A 

NARX (3rd order, wavelet network) 

24.5 

N/A 

NARX (3rd order, Tree partition) 

19.3 

N/A 

NARX (3rd order, sigmoid network) 

8.24 

N/A 

Reduced-rank GP-SSM 

8.17 

-3.71 

Energy forecasting: 

Static GP 

27.7 

-2.54 

Reduced-rank GP-SSM 

21.8 

-2.41 


547 

420 

As reported by 

Frigola et al. 

2014b 

2.1 

0.14 

As reported by 

Frigola et al. 

2'0'14b 

0.7 

0.18 

Average over lU runs. 




Number of MCMC samples K 


Figure 2: The (negative) log likelihood and RMSE for the 
second synthetic example, as a function of number of 
MCMC samples K, averaged (solid lines) over 10 runs 
(dotted lines). 


Our method was evaluated using the provided Matlab 
implementation on a standard desktop computei0 
The choice to use only K = 200 iterations of the learn¬ 
ing algorithm is motivated by Figure illustrating the 
'model quality' (in terms of log likelihood and RMSE) 
as a function of fC: It is clear from Figure that the 
model quality is of the same magnitude after a few hun¬ 
dred samples and after 10 000 samples. As we know 
the sampler converges to the right distribution in the 
limit K ^ oo, this indicates that the sampler converges 
already after a few hundred samples for this example. 
This is most likely thanks to the linear-in-parameter 
structure of the reduced-rank GP, allowing for the effi¬ 
cient Gibbs updates flO- 
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There is an advantage for our proposed reduced-rank 
GP-SSM in terms of LL, but considering the stochastic el¬ 
ements involved in the experiment, the different RMSE 
performance results are hardly outside the error bounds. 
Regarding the computational load, however, there is a 
substantial advantage for our proposed method, enjoy¬ 
ing a training time less only a third of the one by the vari¬ 
ational GP-SSM, which in turn outperforms the method 
by Frigola et al. | |2013| . 


lintel 17-4600 2.1 GHz CPU. 


5.2 Nonlinear Modeling of a Magneto- 
Rheological Fluid Damper 

We also compare our proposed method to state-of-the- 


art conventional system identification methods (Ljung 
1999|. The problem is the modeling of input-output 


behavior of a magneto-rheological fluid damper, intro¬ 
duced by |Wang et al. ( 2009} and used as a case study in 
the System Identification Toolbox for Mathworks Mat- 
lab ( [The Math Works, Inc. 2015) . The data consists of 
3 499 data points, of which 2000 are used for training 
and the remaining for evaluation, shown in Figure 3a 
The data exhibits some non-trivial dynamics, and as the 
T = 2 000 data points probably not contain enough in¬ 
formation to determine the system uniquely, a certain 
amount of uncertainty is present in the posterior. This is 
thus an interesting and realistic problem for a Bayesian 
method, as it possibly can provide useful information 
about the posterior uncertainty, not captured in classical 
maximum likelihood methods for system identification. 

We learn a three-dimensional model: 


Xf+i =fx(xi)-tf„(Mj)-tWt, (16a) 

yt = [0 0 IjXf-t-Cf (16b) 

where e IR^, ~ J\f(0,5), and Wf ~ 7V’(0,Q) with 
Q unknown. We assume a GP prior with an expo¬ 
nentiated quadratic covariance function, with separate 
length-scales for each dimension. We use m = 7^ = 343 
basis function^for and 8 for f„, which in total gives 
1 037 basis function weights and 5 hyperparameters 
0 to sample. 

The learned model was used to simulate a distribu¬ 
tion of the output for the test data, plotted in Figure 
Note how the variance of the prediction changes in dif¬ 
ferent regimes of the plot, quantifying the uncertainty in 
the posterior belief. The resulting output is also evalu¬ 
ated quantitatively in Table together with five state- 
of-the-art maximum likelihood methods, and our pro¬ 
posed method performs on par with the best of these. 

^Explicit expression for the basis functions in the multidimensional 
case is found in the supplementary material. 
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State distribution, and the four step ahead prediction 
density was computed. The data and the predictions are 
shown in Figure [Tb| 

As a sanity check, we compare to a standard GP, 
not explicitly accounting for any dynamics in the time- 
series. The standard GP was trained to the mapping 
from to ft+i’ the performance is evaluated in Ta¬ 
ble!^ From Tablethe gain of encoding dynamical be¬ 
havior in the model is clear. 

6 DISCUSSION 


(a) Fluid damper results 



Q O L _^^^^^^J^^^^_ 

^^JFMAMJJASOND 

Time [days] 


(b) Electricity consumption example 

Figure 3: Data (red) and predicted distributions (gray) 
for the real-data examples. It is interesting to note how 
the variance in the prediction changes between different 
regimes in the plots. 


The learning algorithm took about two hours to run on 
a standard desktop computer. 

The assumed model with known linear g and addi¬ 
tive form fj. -I- f„ could be replaced by an even more 
general structure, but this choice seems to give a sensi¬ 
ble trade-off between structure (reducing computational 
load) and flexibility (increasing computational load) for 
this particular problem. Our proposed Bayesian method 
does indeed appear as a realistic alternative to the max¬ 
imum likelihood methods, without any more problem- 
specific tailoring than the rather natural model assump¬ 
tion 116a 1. 


5.3 Energy Consumption Forecasting 

As a fourth example, we consider the problem of fore¬ 
casting the daily energy consumption in Sweden]^ four 
days in advance. The daily data from 2013 was used for 
training, and the data from 2014 for evaluation. The 
time-series was modeled as an autonomous dynamical 
system (driven only by noise), and a three dimensional 
reduced-rank GP-SSM was trained for this, with all func¬ 
tions and parameters unknown. To obtain the forecast, 
the model was used inside a particle filter to find the 

^Data from Svenska Kraftnat, available: http://www.svk.se/ 
aktorsportalen/elmarknad/statistik/ 


6.1 Tuning 


For a successful application of the proposed algorithm, 
there are a few algorithm-specific parameters for the 
user to choose: The number of basis functions m and 
the number of particles JV in PGAS. A large number 
of basis functions m makes the model more flexible 
and the reduced-rank approximation 'closer' to a non- 
approximated GP, but it also increases the computa¬ 
tional load. With a smooth covariance function k, the 
prior is in practice ^ 0 for moderate j, and m can 
be chosen fairly small (as a rule of thumb, say, 6-15 per 
dimension) without making a too crude approximation. 
In our experience, the number of particles N in PGAS 
can be chooses fairly small (say, 20), without affecting 
the mixing properties of the Markov chain heavily. This 
is in accordance to what has been reported in the litera¬ 
ture by Lindsten et al. ||2014}. 


6.2 Properties of the Proposed Model 

We have proposed to use the reduced-rank approxima¬ 
tion of GPs by Solin and Sarkka | |2014 1 within a state 
space model, to obtain a GP-SSM which efficiently can 
be learned using a PMGMG algorithm. As discussed 
in Section and studied using numerical examples in 
Section the linear-in-the-parameter structure of the 
reduced-rank GP-SSM allows for a computationally effi¬ 
cient learning algorithm. However, the question if a sim¬ 
ilar performance could be obtained with another GP ap¬ 
proximation method or another learning scheme arises 
naturally. 

Other GP approximation methods, for example 
pseudo-inputs, would most likely not allow for such effi¬ 
cient learning as the reduced-rank approximation does; 
unless closed-form Gibbs updates are available (requir¬ 
ing a linear-in-the-parameter structure, or similar), the 
parameter learning would have to resort to Metropolis- 
Hastings, which most likely would give a significantly 
slower learning procedure. For many GP approximation 
methods it is also more natural to find a point estimate 
of the parameters (the inducing points, for example) us¬ 
ing, for example, EM, rather than inferring the parame¬ 
ter posterior, as is the case in this paper. 
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The learning algorithm, on the other hand, could 
probably be replaced by some other method also in¬ 
ferring (at least approximately) the pos terior distribu- 
tion of the parameters, such as SMC^ (Chopin et al. 
or a variational method. However, to maintain 


efficiency, the method has to utilize the linear-in-the- 
parameter structure of the model to reach a computa¬ 
tional load competitive with our proposed scheme. Such 
an alternative (however only inferring MAP estimate of 
the sought quantities) could possibly be the method by 
Kokkala et al. | |2014[ |. 


6.3 Conclusions 

We have proposed the reduced-rank GP-SSM ([^, and 
provided theoretical support for convergence towards 
the full GP-SSM. We have also proposed a theoretically 
sound MGMG-based learning algorithm (including the 
hyperparameters) utilizing the structure of the model ef¬ 
ficiently. 

By demonstration on several examples, the compu¬ 
tational load and the modeling capabilities of our ap¬ 
proach have been proven to be competitive. The com¬ 
putational load of the learning is even less than in the 
variational sparse GP solution provided by |Frigola et al. 
| |2014b[ |, and the performance in challenging input- 
output modeling is on par with well-established state- 
of-the-art maximum likelihood methods. 


6.4 Possible Extensions and Further Work 


A natural extension for applications where some do¬ 
main knowledge is present, is to let the model include 
some functions with an a priori known parametrization. 
The handling of such models in the learning algorithm 
should be feasible, as it is already known how to use 
PGAS for such models ( Lindsten et al.[ 2014| . Further, 
the assumptions of the XW prior of Q |[^ are possible to 
circumvent by using, for example, MH, at the cost of an 
increased computational load. The same holds true for 
the Gaussian noise assumption in |[^. 

Another direction for further work is to adapt the pro¬ 
cess to be able to sequentially learn and improve the 
model when data is added in batches, by formulating the 
previously learned model as the prior to the next itera¬ 
tion of the learning. This could probably be useful in, 
for example, reinforcement learning, along the lines of 
Deisenroth et al.|(|2015|. 


In the engineering literature, dynamical systems are 
mostly defined in discrete time. An interesting approach 
to model the continuous-time counterpart using Gaus¬ 
sian processes is presented by Ruttor et al. ( 2013| . A de¬ 
velopment of the reduced-rank GP-SSM to continuous 
time dynamical models using stochastic Runge-Kutta 
methods would be of great interest for further research. 
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Supplementary material 

This is the supplementary material for 'Computationally Efficient Bayesian Learning of Gaussian Process State 
Space Models' by Svensson, Solin, Sarkka and Schon, presented at AISTATS 2016. The references in this document 
point to the bibliography in the article. 


1 Proofs 


Proof of Theorem 4.1 Let us start by considering the GP approximation to f(x), x e [-Li,Li] x ••• x By 

Theorem 4.4 of Solin and Sarkka ( 2014| , when domain size inf; L, ^ oo and the number of basis functions m ^ oo, 
the approximate covariance function k^(x,x') converges point-wise to k(x,x'). As the prior means of the exact and 
approximate GPs are both zero, the means thus converge as well. By similar argument as is used in the proof of 
Theorem 2.2 in Sarkka and Pich^ ( 2014| it follows that the posterior mean and covariance functions will converge 
point-wise as well. 

Now, consider the random variables defined by 


Xf+l =f(Xt)-tW;, (17) 

Vl =fm(X()-tW;, (18) 


where f„, is an m-term series expansion approximation to the GP. It now follows that for any fixed X; the mean and 
covariance of X;+i and Xf+j coincide when L;,m ^ oo. However, because these random variables are Gaussian, the 
first two moments determine the whole distribution and hence we can conclude that X;^i ^ x^+j in distribution. 
Lor the measurement model we can similarly consider the random variables 

yi=g(xf) + ef, (19) 

y* =gm(x;)-tet, (20) 


With similar argument as above, we can conclude that the approximation converges in distribution. □ 

Proof of Theorem ^4.2\ Provided the reduced-rank approximation of the Gram matrix, the reduction in the compu¬ 
tational load directly follows from application of the matrix inversion lemma. □ 


Proof of Theoreni ^4. 3\ Using fundamental properties of the Gibbs sampler (see, e.g., Tierney| | |1994| ), the claim holds 
if all steps of Algorithm are leaving the right conditional probability density invariant. Step is justified by 
Lindsten et al.|(2014| (even for a finite N), and step [4]{^ by [Wills et al. | |2012| . Purther, step can be seen as a 
Metropolis-within-Gibbs procedure (Tierney) 1994}. □ 


2 Details on Matrix Normal and Inverse Wishart distributions 


As presented in the article, the matrix normal inverse Wishart (MNIW) distribution is the conjugate prior for state 
space models linear in its parameters A e and Q e Wills et al. ( 2012| . The MNIW distribution can be 

written as MM{A,Q \ M,'V,£,A) = MAf{A \ M,Q,V) xX>V(Q | £,A), where each part is defined as follows: 

• The pdf for the Inverse Wishart distribution with £ degrees of freedom and positive definite scale matrix 
AelR"’'”: 

IW(Q\£,A)= ' exp --tr(Q-iA) (21) 


with r,. 


2^”/2r„(^/2) 

being the multivariate gamma function. 


• The pdf for the Matrix Normal distribution with mean M e right covariance Q e IR" 

V e 

|vp^2 


MN(A I M, Q, V) = exp (-itr ((A - M)Tq-i (A - M)V)) 


and left precision 
( 22 ) 


To sample from the MN distribution, one may sample a matrix X e ]R"^™ of i.i.d. Af(0,1) random variables, and 
obtain A as A = M -i- chol(Q)Xchol(V), where chol denotes the Gholesky factor (V = chol(V)chol(V)^). 








































3 Eigenfunctions for Multi-Dimensional Spaces 


The eigenfunctions for a rf-dimensional space with a rectangular domain [-Li,Li] x ••• x used in Exam- 

ple|5.2|and Example|5.3| are on the form 


1 

(hdi’-’id) = 1 f ——sin 

Uv4 


njk(xk + Lk) 

2U 


with A 


h--U 


d 

L 

k=l 


Tt]k 

2Lk 


(23) 


Note how this for d = 1 reduces to the univariate case presented in Section [13 
Section 4.2 in Solin and Sarkka | |2014| . 


Eor further details we refer to 


4 Provided Matlab Software 


The following Matlab files are available via the first authors homepage: 


File 

Use 

Comments 

synthetlc_example_1.m 
synthetlc_example_2.m 
clamper. m 

First synthetic example (including Figure^ 
Second synthetic example 

MR damper example 

For other results, see |The| 

energy_forecast.m 
iwlshpdf.m 
mvnpdf_log.m 
systematic_resampling.m 

Energy consumption forecasting example 
Implements |21) 

Logarithm or normal distribution pdf 
Systematic resampling (Stepj^ Algorithmj^ 

|MathWorks, Inc.||201^ 

All files are published under the GPL license. 
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